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SUMMARY 


Introduction 


This  paper  presents  a comprehensive  review  of  the  research  literature  on 
an  aspect  of  probability  assessment  called  "calibration."  Calibration 
measures  the  validity  of  probability  assessments.  Being  well-calibrated 
Is  critical  for  optimal  declslon-maklng  and  for  the  development  of  decision 
aiding  techniques. 


Background  and  Approach 


Subjective  probability  assessments  play  a key  role  In  decision  making. 

It  Is  often  necessary  to  rely  on  an  expert  to  assess  the  probability  of  some 
future  event.  How  good  are  such  assessments?  One  Important  aspect  of 
their  quality  Is  called  calibration.  Formally,  an  assessor  Is  calibrated 
If,  over  the  long  run,  for  all  statements  assigned  a given  probability  (e.g. 
the  prob^d>lllty  Is  .65  that  "Romania  will  maintain  Its  current  relation 
with  People's  China."),  the  proportion  that  is  true  Is  equal  to  the  probabi- 
lity assigned.  For  example.  If  you  are  well  calibrated,  then  across  all 
the  many  occasions  that  you  assign  a probability  of  .8,  In  the  long  run  80Z 
of  then  should  turn  out  to  be  true.  If,  Instead,  only  70Z  are  true,  you  are 
not  well  calibrated,  you  are  overconfident . If  95Z  of  them  are  true,  you 
are  underconf Ident . The  figure  below  shows  calibration  curves  of  well-cali- 
brated, overconfident  and  under confident  assessors. 
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While  this  characteristic  of  assessors  hss  obvious  laportance  for  spplled 
situations,  people's  calibration  has  rarely  been  discussed  by  decision 
analysts  or  decision  advisors.  In  the  last  few  years,  there  has  developed 
an  extensive  literature  about  calibration,  reporting  both  laboratory  and 
real-world  experlnents.  It  Is  now  time  to  review  this  literature,  to  look 
for  common  findings  which  can  be  used  to  improve  decisions,  and  to  Identify 
unsolved  problems. 

Findings 

TWo  general  classes  of  calibration  pr(rt>lem  have  been  studied.  The  first  j 

class  Is  calibration  for  events  for  which  the  outcome  Is  discrete.  These 

Include  probabilities  assigned  to  statements  like  "I  know  the  answer  to 

that  question,"  "They  are  planning  an  attack,"  or  "Our  alarm  system  Is 

foolproof."  For  such  tasks,  the  following  generalisations  are  justified 

by  the  research: 

1.  Weather  forecasters,  who  typically  have  had  several  years  of  experience 

In  assessing  probabilities,  are  quite  well  calibrated.  \ 

2.  Other  experiments,  using  a wide  variety  of  tasks  and  subjects,  show  ] 

that  people  are  generally  quite  poorly  calibrated.  In  particular,  people  j 

act  as  though  they  can  make  much  finer  distinctions  In  their  degree  of  i i 

uncertainty  than  Is  actually  the  case.  i | 

1 > 
i 

3.  Overconfidence  Is  found  In  most  tasks;  that  is,  people  tend  to  over-  | j 

estimate  how  much  they  know. 

' ] 

4.  Despite  the  abundant  evidence  that  untutored  assessors  are  badly  1 

calibrated,  there  Is  little  research  showing  how  and  how  well  these 

deficiencies  can  be  overcome  through  training.  ] 

The  second  class  of  tasks  is  calibration  for  probabilities  assigned  to  ] 

uncertain  continuous  quantities.  For  example,  what  Is  the  mean  time  between 

failures  for  this  system?  How  much  will  this  project  cost?  The  assessor 

must  report  a probability  density  function  across  the  possible  values  of 

such  uncertain  quantities.  The  usual  method  for  eliciting  such  probability 

density  functions  Is  to  assess  a small  number  of  fractlles  of  the  function. 

The  .25  fractlle,  for  example,  is  that  value  of  the  uncertain  quantity  such 

that  there  is  just  a 25Z  chance  that  the  true  value  will  be  smaller  than 

the  specified  value.  Suppose  we  had  a person  assess  a large  ntimberof  .25 

fractlles.  He  would  be  giving  numbers  such  that,  for  example,  "There  Is 

a 25%  chance  that  this  repair  will  be  done  in  less  than  x.  hours"  or  "There 

Is  a 25%  chance  that  Warsaw  Pact  personnel  in  Czechoslovakia  nunl>er  less 

than  This  person  will  be  well  calibrated  If,  over  a large  set  of  such 

estimates,  the  true  value  will  be  less  than  x.  25%  of  the  time.  The  measures  ' 

of  calibration  used  most  frequently  in  research  consider  pairs  of  extreme 

fractlles.  For  example,  experimenters  assess  calibration  by  asking  whether 

98%  of  the  true  values  fall  between  an  assessor's  .01  and  .99  fractlles. 

For  calibration  of  continuous  quantities,  the  following  results  sumnarlze 
the  research. 
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1.  A nearly  universal  bias  la  found:  assessors'  probability  density  functions 

are  too  narrow.  For  example,  20  to  50Z  of  the  true  values  lie  outside  the  .01 
and  .99  fractlles.  Instead  of  the  prescribed  22.  This  bias  reflects  over- 
confidence;  the  assessors  think  they  know  more  about  the  uncertain  quantities 
than  they  actually  do  know. 

2.  Some  data  from  weather  forecasters  suggests  that  they  are  not  overconfident 
In  this  task.  But  It  Is  unclear  whether  this  Is  due  to  training,  experience, 
special  Instructions,  or  the  specific  uncertain  quantities  they  deal  with 
(e.g.,  tomorrow's  high  temperature). 

3.  A few  studies  have  Indicated  that,  with  practice,  people  can  learn  to 
become  somewhat  better  calibrated. 

Implications 

Since  assessed  probabilities  are  central  to  a wide  variety  of  decision 
problems  (e.g.,  making  Intelligence  estimates,  assessing  system  reliability, 
projecting  costs,  deciding  whether  to  acquire  more  Information),  the  quee^lon 
of  whether  such  probabilities  are  calibrated  has  far-reaching  l^ortance. 

Almost  all  decision  analyses  Invol'^e  probability  eissessments . If  these 
assessments  are  In  error,  the  finest  analysis  relying  on  them  nay  be  faulty. 

The  bias  towards  overconfidence  reported  here  Is  widespread  and  well  documented. 

What  Is  not  so  well  established  Is  whether,  and  how,  this  bias  can  be  overcome 

through  training.  The  superior  performance  of  weather  forecasters  Is 

encouraging.  These  people  have  been  using  probabilities  In  their  forecasts 

on  a dally  basis  for  several  years;  one  might  assume  that  this  experience 

accounts  for  their  excellence.  Further  research  Is  needed  to  document  just  i 

how  much  training,  with  what  kind  of  feedback.  Is  most  efficient  for  Improving 

assessors'  calibration.  Such  research  Is  crucial  to  developing  a viable 

decision  analysis  technology.  It  also  helps  tell  us  how  much  faith  to  put  | 

In  the  probability  assessments  and  decisions  of  untrained  decision  makers  i 

working  without  the  benefit  of  decision  aids.  1 
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^ CALIBRATION  OF  PROBABILITIES:  THE  STATE  OF  THE  ART 

INTRODUCTION 

From  the  subjective  point  of  view  (de  Flnettl,  1937)  a probability 
Is  a degree  of  belief  In  a proposition  whose  truth  has  not  been  ascertained. 

A probability  expresses  a purely  Internal  state;  there  Is  no  "right"  or 
"correct"  probability  that  resides  somewhere  "In  reality"  against  which  It 
can  be  compared.  However,  In  many  circumstances.  It  may  become  possible  to 
verify  the  truth  or  falsity  of  the  proposition  to  whldi  a probability  was 
attached.  Today,  we  assess  the  probability  of  the  proposition  "it  will 
rain  tomorrow."  Tomorrow,  we  go  outside  and  look  at  the  rain  gauge  to  see 
whether  or  not  It  has  rained.  When  verification  Is  possible,  we  can  use  It 
to  gauge  the  adequacy  of  our  probability  assessments. 

Assessors'  adequacy  has  been  discussed  by  Winkler  and  Murphy  (1968a), 
who  Identified  two  general  kinds  of  "goodness,"  normative  goodness,  which 
reflects  the  degree  to  which  the  assessments  conform  to  the  axioms  of  probability 
and  espress  the  assessor's  true  beliefs,  and  substantive  goodness,  which  reflects 
the  amount  of  knowledge  of  the  topic  area  contained  In  the  assessments.  This 
paper  reviews  the  literature  about  the  kind  of  adequacy  called  calibration. 

If  a person  assesses  the  probability  ot  a proposition's  being  true  as  .7, 
and  later  finds  that  the  proposition  Is  false,  that  In  Itself  does  not  Invalidate 
the  assessment.  However,  If  a judge  assigns  .7  to  10,000  Independent  propositions, 
only  25  of  which  subsequently  are  found  to  be  true,  there  Is  something  wrong 
with  these  assessments.  The  attribute  which  they  lack  we  call  calibration. 

This  attribute  has  also  been  called  "realism"  (Brown  and  Shuford,  1973), 

"realism  of  confidence"  (Adams  and  Adams,  1961),  "appropriateness  of  confidence" 
(Oskamp,  1962),  "external  validity"  (Brown  and  Shuford,  1973),  "secondary 
validity"  (Murphy  and  Winkler,  1971),  and  "reliability"  (Murphy,  1973).  Formally, 
a judge  Is  calibrated  If,  over  the  long  run,  for  all  propositions  assigned  a 


1 


given  probability,  the  proportion  that  la  true  la  equal  to  the  probability 
aaalgned.  We  can  empirically  evaluate  Judges'  calibration  by  observing  their 
probability  assessments,  verifying  the  associated  propositions  and  then 
observing  the  proportion  that  Is  true  in  each  response  category.  Judges  who 
are  not  calibrated  may  be  either  underconfident  or  overconfident.  For  the 
underconfident  assessor,  the  proportion  of  propositions  that  are  true  la 
greater  than  the  probability  assigned  to  them.  With  overconfidence,  too  few 
propositions  are  true. 

In  this  paper,  we  review  the  experimental  literature  on  calibration, 
separated  somewhat  arbitrarily  into  two  sections.  The  first  Is  devoted  to 
the  calibration  of  assessors  making  probability  Judgments  about  discrete 
propositions;  the  second,  to  calibration  for  probability  density  function 
concerning  uncertain  ntimerlcal  quantities.  The  arbitrariness  arises  from 
the  fact  that  an  uncertain  quantity,  for  example,  "the  population  of  Brazil," 
can  always  be  reworded  Into  one  or  more  discrete  proposltons,  such  as  "the 
population  of  Brazil  exceeds  85  million."  In  a few  cases,  our  decision  about 
which  section  an  experiment  should  be  discussed  In  depended  more  on  how  the  authors 
reported  their  data  than  on  how  their  subjects  perceived  the  task. 

Calibration  Is  essentially  a property  of  single  Individuals.  Most  of 
the  results  reviewed  here,  however,  are  grouped  across  subjects.  Although 
grouping  Is  often  necessary  to  secure  the  large  quantities  of  data  needed  for 
stable  estimates  of  calibration.  It  can  both  obscure  Interesting  Individual 
differences  and  cause  serious  biases  In  studies  in  which  only  a few  Items  are 
presented  to  many  subjects.  The  experlemnter  who  relies  on  but  a few  stimuli 
may  run  the  risk  of  Inadvertently  including  a preponderance  of  Items  which 
most  subjects  answer  Incorrectly  (e.g..  Are  potatoes  native  to  Ireland  or 
Bolivia?  How  many  people  live  in  Outer  Mongolia?).  With  such  "deceptive" 

Items,  perfect  calibration  Is  Impossible.  A large  munber  of  Items  Is  one 


protection  against  this  problem. 

DISCRETE  PROPOSITIONS 

Discrete  propositions  can  be  stated  with  any  nuni>er  of  alternatives: 

No  alternatives:  What  Is  absinthe?  The  subject  Is  asked  to  provide  an 

answer,  and  then  to  give  the  probability  that  his  or  her  answer  Is  correct. 
The  entire  range  of  probability  responses,  from  0 to  1,  Is  appropriate. 

Only  Adams  (1957)  has  looked  at  calibration  for  this  task. 

One  alternative:  Absinthe  Is  a precious  stone.  What  is  the  probability 

that  this  statement  Is  true?  Again,  the  relevant  range  of  the  probability 
scale  Is  0 to  1. 

Two  alternatives:  Absinthe  Is  (a)  a precious  stone;  (b)  a liqueur. 

With  the  "half-range"  method,  the  subject  first  selects  the  more  likley 
alternative,  and  then  states  the  probability  that  this  choice  Is  correct. 

This  response  must  be  ^ .5.  With  the  "full-range"  method,  the  subject  gives 
the  probability  that  a prespecified  alternative  Is  correct.  Here  the  subject 
may  use  any  response  from  0 to  1. 

Three  or  more  alternatives:  Absinthe  Is  (a)  a precious  stone;  (b)  a 

liqueur;  (c)  a Caribbean  Island;  (d)  . . . Two  variations  of  this  task 
may  be  used:  (1)  the  subject  selects  the  single  most  likely  alternative  and 

states  the  probability  that  It  Is  correct,  using  a response  1/k  for  k 
alternatives;  (2)  the  subject  assigns  probabilities  to  all  alternatives, 
using  the  range  0 to  1.  This  procedure  Induces  dependencies  In  the  data,  by 
requiring  the  k assessments  to  sm  to  1. 

For  all  these  variations,  calibration  may  be  reported  via  a "calibration 
curve."  Such  a curve  Is  derived  as  follows:  (1)  Collect  many  probabilistic 

responses  to  items  whose  correct  answer  Is  known  or  will  shortly  be  known  to 
the  experimenters.  (2)  Categorize  the  responses,  usually  within  ranges;  for 


exanple»  all  responses  between  .60  and  .69  are  placed  In  the  sane  category. 

(3)  Conpute  for  each  category  the  proportion  correct,  that  is,  the  proportion 
of  Items  for  «dilch  the  proposition  Is  true.  (4)  For  each  category,  plot  the 
mean  response  against  the  proportion  correct. 

Several  measures  of  overall  calibration  have  been  proposed.  Murphy  (1973) 
has  looked  at  the  general  case  of  k-altematlve  Items.  Each  response, 

1,  Is  represented  by  a row  vector  of  probabilities,  ^ “ (r^^,  ...,t|^),  and 

the  associated  outcome  by  a row  vector  c^  ■ (Cj^^ ®ji*  * * * *‘Tc1^  ’ ®ji 

equals  one  for  the  true  alternative  and  zero  otherwise.  Given  response 
vectors  for  N Items  form  a single  Individual,  the  Brier  (1950)  scoring  rule 
(proper  quadratic  scoring  rule  such  that  the  smaller  the  score,  the  better) 

Is: 


• ■ 5 ■ 

In  which  the  prime  denotes  a column  vector.  Murphy  partitioned  this  score 
Into  three  terms.  The  response  vectors  are  sorted  into  T subcollectlons 
such  that  all  the  responses  In  a subcollection  are  Identical.  Let  n^  be 
the  number  of  responses  In  the  t'th  sub collection,  and  let  c^  be  the 
proportion-correct  vector  for  the  t'th  subcollection: 

it  ■ <°it 'jt  ■ • 

Let  c be  the  proportion-correct  vector  across  all  responses. 


N 


(Cj^,...,Cj,..., c^)  , where  c ^ Z Cj ^ , 

and  let  u be  the  unity  vector,  a row  vector  whose  k elements  are  all  one. 
Then  Murphy's  partition  of  the  Brier  score  is: 


B ■ £(ii“£) ' 
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The  first  tern  neasures  the  uncertainty  Inherent  In  the  set  of  N Itens. 
For  example,  If  all  Items  concern  rain  vs.  no  rain,  this  term  reflects 
how  often  It  rained  In  fact.  The  second  term,  whldi  Murphy  called 
"reliability,"  Is  a measure  of  calibration,  the  weighted  sum  of  squares 
of  the  difference  between  the  responses  and  the  proportion  correct  for 
those  responses.  The  third  term,  called  "resolution,"  reflects  the  ability 
of  the  assessor  to  sort  the  events  Into  subcategories  for  which  the  hit 
rate  Is  maximally  different  from  the  overall  hit  rate. 

Murphy  (1974)  has  further  suggested  a "sample  skill  score"  to  measure 
the  skill  of  forecasters.  This  score,  which  constitutes  a proper  scoring 


rule.  Is  calculated  by  subtracting  the  second  term  In  the  partition,  ] 

i 

calibration,  from  the  third  term,  resolution.  Assessors  should  maximize  I 


Murphy's  partition  was  designed  for  repeated  predictions  of  the 


same  event,  e.g.,  rain.  When  the  Items  are  diverse,  as  in  a multiple- 


choice  examination,  so  that  the  alternatives  can  be  Identified  only  as 


and  so  forth,  then  the  first  term  is 


not  meaningful;  it  is  simply  a function  of  the  order  In  which  the  true  alternatives 


When  the  assessor  Is  asked  first  which  Is  the  correct  alternative,  and 


next  what  the  probability  is  that  the  chosen  alternative  is  correct,  only 


one  response  per  item  is  scored.  In  these  cases,  Murphy's  (1974)  measure 


reduces  to  what  he  has  called  (Murphy,  1972)  the  "special  scalar  partition 


where  c Is  the  overall  proportion  correct,  and  c^  Is  the  proportion  correct 


In  the  t'th  subcategory.  When  the  second  response  is  the  response  ^ .5  (as 


with  the  two-alternative,  half-range  task),  the  flrat  term  does  have  an 
Interpretation:  it  reflects  the  subject's  ability  to  pick  the  correct 

alternative,  and  thus  might  be  called  "knowledge."  The  second  term  measures 
calibration,  and  the  third,  resolution,  as  before. 

This  scalar  measure  of  calibration,  a weighted  squared  error,  is  similar 
to  measures  proposed  by  Adams  and  Adams  (1961),  who  used  a "mean  absolute 
discrepancy  score," 


T _ T 

t-1  t-1 


and  by  Oskaoq>  (1962),  who  used  an  "appropriateness  of  confidence"  scale: 


1 

ii 
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Shuford  and  Brown  (1975)  also  started  with  a proper  scoring  rule, 
the  logarithmic.  In  addition  to  computing  a score  for  the  assessor's 
responses,  S,  they  proposed  fitting  a least  squares  regression  line  to 
the  data  in  a calibration  curve.  The  equation  for  the  best-fitting  line 
can  be  used  to  externally  recalibrate  the  assessor'' s responses,  in  order 
to  correct  for  systematic  bias.  One  can  then  compute  the  score  for  these 

A A 

recalibrated  responses,  S.  If  M is  the  maximum  score  possible,  then  M-S 

A 

measures  the  loss  in  score  due  to  lack  of  knowledge,  while  S-S  measures  the 
loss  in  score  due  to  poor  calibration. 

None  of  these  measures  of  calibration  have  as  yet  gained  acceptance  in 
the  research  literature.  None  discriminate  overconfidence  from  underconfidence. 
Nothing  is  known  about  the  sampling  properties  of  any  of  the  measures. 

Meteorological  Research 

In  1906,  W.  Ernest  Cooke,  Government  Astronomer  for  Western  Australia, 
advocated  that  each  meteorological  prediction  be  accompanied  by  a single 
number  which  would  "indicate,  approximately,  the  tralght  or  degree  of 
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prediction.'*  He  reported  (Cooke,  1906e,  1906b)  results  froa  1,951  predictions 


Of  those  to  vfaldi  he  hed  attached  s weight  of  5 ("alaost  certain  to  be 
verified”),  .985  were  correct.  For  his  weight  of  4 C'nomsl  probability”) 


938  were  correct,  while  for  his  weight  of  3 ("doubtful"),  .787  were  correct 


In  1951,  Wllllaas  asked  eight  professional  Heather  Bureau  forecasters  In 


2,  .4,  .6,  .8,  or  1.0  with  each 


12-hour  forecast  of  precipitation.  The  calibration  curve  for  1,095  predictions 


appears  In  Figure  1.  These  assessments  of  the  probability  of  precipitation  were 


too  high  throughout  most  of  the  range  (see  Figure  1).  This  nl^t  be  the 


result  of  a fairly  natural  form  of  hedging  In  public  proeouncegients . People 


are  much  less  likely  to  criticize  a weather  forecast  that  leads  them  to  carry 


an  umbrella  when  It  does  not  rain  than  one  that  leads  them  to  be  without  sn 


Similar  results  emerged  from  two  studies  of  forecasters  reported  by 


Murphy  and  Winkler  (1974).  One  of  their  studies  dealt  with  the  effect  of 


a computerized  weather  prediction  system  (PEATMOS)  on  forecasters'  assessments 


The  task  was  to  assess  the  probability  of  precipitation  the  following  day 


Forecasters  did  this  twice,  before  and  then  again  after  seeing  the  PEATMOS 


output.  Data  were  collected  in  Great  Falls,  Montana,  and  Seattle,  Washington 


All  7,188  assessments  (before  and  after  PEATMOS  In  both  cities)  were  combined 


to  produce  the  calibration  curve  in  Figure  1,  which  shows  the  same  over- 


estimation of  the  probability  of  rain 


In  the  other  study  forecasters  were  asked  to  predict  the  next  day's 


high  temperature.  Two  forecasters  tised  a "fixed-width,  variable-probability' 


technique.  First,  they  named  the  median  temperature.  Then  they  stated  the 


prcAablllty  that  the  temperature  would  fall  within  Intervals  of  5*  F and  9*  F 


centered  at  the  median.  Such  a technique  converts  a continuous  probability 


dlatributlon  Into  • two-alternative  dlacrete  teak:  the  tcaperature  la  acored 

aa  falling  either  within  or  outalde  of  the  atated  Interval.  Calibration  for 
241  auch  aaaeaaaenta  la  ahown  In  Figure  1.  Theae  forecaatera,  who  could 
have  uaed  any  probability  between  0 and  1,  reaponded  below  .4  on  only  three 
occaalona  (excluded  from  the  curve).  Again,  we  aee  a ayatematlc  blaa  acroaa 
the  entire  range  covered:  the  probability  aaaoclated  with  the  temperature 

falling  Inalde  the  Interval  la  alwaya  too  large.  Better  calibration  waa 
reported  by  Sanders  (1958)  who  collected  12,635  predictions  of  a variety 
of  dichotomized  events:  wind  direction,  wind  speed,  gusts,  temperatures, 

cloud  amount,  celling,  visibility,  precipitation  occurrence,  precipitation 

type,  and  thunder-storm,  using  the  eleven  responses  0,  .1 9,  1.0. 

The  resulting  calibration  curve  Is  shown  In  Figure  1.^ 

In  contrast  to  the  meteorological  studies  showing  a constant  bias 
across  almost  the  entire  response  range,  Root  (1962)  has  reported  calibration 
for  4,138  precipitation  forecasts  which  shows  (see  Figure  1)  a more  systematic 
pattern.  Here,  assessed  probabilities  were  too  low  In  Che  low  range  and  Coo 
high  in  the  high  range,  relative  to  the  observed  frequencies.  This  pattern 
Indicates  overconfidence  both  for  the  proposition,  'It  most  likely  will  rain," 
and  for  the  proposition,  "It  most  likely  won't  rain." 

Figure  2 shows  calibration  curves  for  one  year  of  precipitation 
probability  forecasts  from  Hartford,  Connecticut  (Winkler  and  Murphy,  1968b). 
These  forecasters  had  the  option  of  forecasting  for  either  a six-hour  period 
or  a twelve-hour  period.  They  made  3,174  six-hour  forecasts  tmd  2,936 
twelve-hour  forecasts;  these  data  are  shown  separately.  There  was  some 
ambiguity  about  whether  the  forecasters  had  intended  to  Include  or  exclude 
^ The  references  by  Cooke  (1906) , Willlsms  (1951) , and  Sanders  (1958)  were 
brought  to  our  attention  through  an  unpublished  manuscript  by  Howard  Ralffa, 
dated  January,  1969,  entitled  "Assessments  of  probabilities." 
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"a  trace  of  precipitation"  (less  than  .01  Inches)  In  their  predictions  of  I 

precipitation.  Accordingly,  the  data  were  analyzed  twice,  once  assuming  that 
"precipltatloa"  included  the  occurrence  of  traces,  and  once  assuming  that 
"precipitation"  did  not  include  traces.  The  inclusion  or  exclusion  of 

I 

traces  had  a substantial  effect,  as  did  the  choice  of  time  period.  Six-hour  ii 

t 

forecasts  were  associated  with  lower  observed  frequencies  than  were  j: 

i 

i 

twelve-hour  forecasts.  Thus  the  forecasters  were  found  to  assess  precipitation  I; 

probabilities  that  were  too  high  for  the  six-hour,  traces  excluded  case,  and 

I ■ 

too  low  for  the  twelve-hour,  traces  included  case,  relative  to  the  observed 


frequencies,  while  the  other  two  cases  showed  very  good  calibration. 

The  United  States  Weather  Bureau  (1969)  has  collected  massive  amounts  of 

calibration  data  for  precipitation  forecasts  made  from  April,  1967,  to 

March,  1968,  at  sites  all  over  the  country.  Figure  3 shows  Just  one-fourth  of 

these  data  (the  rest  of  the  data  were  highly  similar);  each  curve  is  based 

on  more  than  16,800  forecasts.  The  solid-line  curve  is  for  forecasts  for 

the  first  time  period,  that  which  immediately  followed  the  time  the  forecast 

was  made.  Here,  calibration  was  excellent,  with  a mean  absolute  error  of 

only  .03.  As  the  lag  between  the  time  the  forecast  was  made  and  the  period 

it  referred  to  increased,  calibration  deteriorated.  This  deterioration  was 

not  as  great  as  it  appears  in  the  figure,  because  in  the  later  periods 

forecasters  used  fewer  responses  in  the  high  range.  Thus,  even  for  the  third 

2 

period  the  mean  absolute  error  was  only  .05.  Murphy  believes  that  these 
data  more  accurately  represent  the  current  performance  of  weather  forecasters 

than  do  the  data  in  Figures  1 and  2.  He  attributes  the  superior  performance 

in  the  present  report  to  the  increased  experience  with  probabilities  that 
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Early  Laboratory  Research 

In  1957t  Joe  Adams  published  a paper  using  an  eleven-point  "confidence 
scale"  with  a sero-altematlve  task.  His  subjects  were  trained  to  use,  not 
probabilities,  but  a scale  defined  to  them  In  precisely  the  way  we  have 
defined  calibration:  "[the  subject  was]  Instructed  to  express  his  confidence 
In  terms  of  the  percentage  of  responses,  made  at  that  particular  level  of 
confidence,  that  he  expects  to  be  correct.  . . Of  those  responses  made  with 
confidence  £,  about  £Z  should  be  correct"  (p.  432-3). 

Each  of  forty  words  was  presented  tachlstoscoplcally  ten  times 
successively,  with  Increasing  Illumination  each  time,  to  ten  subjects. 

After  each  exposure  subjects  wrote  down  the  word  they  thought  they  saw, 
and  gave  a confidence  Judgment,  limited  to  the  nunl>ers  0,  10,  20,  . . . 90, 
100.  The  resulting  calibration  curve,  across  subjects.  Is  shown  In  Figure  4. 
Great  caution  must  be  taken  In  Interpreting  the  data:  because  each  word 

was  shown  10  times,  the  responses  are  highly  Interdependent.  It  Is  unknown 
what  effect  such  Interdependence  has  on  calibration,  but  the  finding  of  gross 
under confidence  along  the  entire  response  scale  has  been  replicated  with  only 
one  subject  In  one  experiment  (Swets,  Tanner  and  Blrdsall,  1961).  Perhaps 
subjects  were  "holding  back,"  unwilling  to  give  a high  response  when  they 
knew  that  the  same  word  would  be  presented  several  more  times. 

The  following  year  Adams  and  Adams  (1958)  reported  a training  experiment 
using  the  same  response  scale,  but  a new,  three-altematlve , single-response 
task:  For  each  of  156  pairs  of  words  per  session,  subjects  were  asked 

whether  the  words  were  antonyms,  synonyms,  or  unrelated.  Thirteen  of  the 
14  experimental  subjects,  who  were  shown  calibration  tallies  and  calibration 
curves  after  each  of  five  sessions,  had  lower  discrepancy  scores  on  the  fifth 
day  than  on  the  first.  The  mean  decrease  for  the  14  subjects  was  48Z.  Six 
control  subjects,  whose  only  feedback  was  a tally  of  their  unscored  responses 


shoved  s 36Z  oean  Increase  In  discrepancy  scores.  Plgura  4 shows  the 
calibration^  grouped  across  all  five  sessions  for  one  experimental  subject — 
the  only  subject  for  whom  Adams  and  Adams  reported  such  data. 

In  a 1961  Psychological  Review  article,  Adams  and  Adams  discussed  many 
aspects  of  the  calibration  of  probabilities  (using  the  term  "realism  of 
confidence"),  anticipating  much  of  the  work  done  by  others  In  recent  years, 
and  presented  more  bits  of  data.  Including  the  grossly  overconfident  calibration 
curve  of  a schizophrenic  who  believed  he  was  Jesus  Christ.  They  reported 
calibration  curves  from  a nonsense  syllable  learning  task  with  large 
overconfidence  after  one  trial  and  Improvement  after  16  trials.  They  also 
described  briefly  a "transfer  of  training"  experiment:  On  the  first  day, 

subjects  made  108  decisions  about  the  percentage  of  blue  dots  in  an  array  of 
blue  and  red  dots.  On  the  second  and  fourth  days,  the  subjects  dedided  on 
the  truth  or  falsity  of  250  general  statements.  On  the  third  day,  they  lifted 
weights  blindfolded.  On  the  fifth  day,  they  made  256  decisions  (synonym,  antonym, 
or  unrelated)  about  pairs  of  words.  Eight  experimental  subjects,  given 
calibration  feedback  during  the  first  four  days,  showed  on  the  fifth  day  a 
mean  absolute  discrepancy  score  significantly  lower  than  that  of  eight  control 
(no  feedback)  subjects,  suggesting  some  transfer  of  training.  Finally, 

Adams  and  Adams  reported  a correlation  of  .36  between  absolute  discrepancy 
scores  and  fear  of  failure  (achievement  anxiety)  for  56  subjects  taking  a 
multiple  choice  final  examination  In  elementary  psychology.  Neither  over- 
nor  underconfidence  nor  knowledge  was  related  to  fear  of  failure,  only 
calibration. 

One  can  suppose  that,  having  originated  such  a wide  range  of  thoughtful 
Ideas,  the  Adamses  sat  back  to  watch  the  procession  of  further  work  on  the 
topic.  If  so,  they  may  still  be  waiting.  Except  for  the  study  by  Oskamp 
(1962)  described  next,  no  other  work  appeared  for  over  ten  years,  and  of  all 
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the  other  literature  reviewed  In  this  paper,  not  a single  author  has  referenced 
the  Adamses'  or  Oskamp's  work! 

3 

Oskamp  (1962)  used  200  MMPI  profiles  as  stimuli.  Half  the  profiles 
were  from  men  admitted  to  a VA  hospital  for  psychiatric  reasons;  the  others 
were  from  men  admitted  for  purely  medical  reasons.  The  task  was  to  decide, 
for  each  profile,  whether  the  patient's  status  was  psychiatric  or  medical, 
and  state  the  probability  that  the  decision  was  correct,  using  the  half-range 
method.  Each  profile  had  been  Independently  categorized  as  hard  (61  profiles), 
medium  (88),  or  easy  (51)  on  the  basis  of  an  actuarlally-derlved  classification 
system,  which  correctly  Identified  57Z,  69Z,  and  92Z  of  the  hard,  medium,  and 
easy  profiles. 

Three  groups  of  subjects  judged  all  200  profiles:  28  undergraduate 

psychology  majors,  23  clinical  psychology  trainees  working  at  a VA  hospital, 
and  21  experienced  clinical  psychologists.  The  28  Inexperienced  judges 
were  later  split  Into  two  matched  groups,  and  given  the  same  200  profiles 
again.  Half  were  trained  to  Improve  accuracy:  after  the  first  50  repeated 
profiles,  they  were  told  their  percent  correct  for  the  first  200  and  the 
just-completed  50,  and  Instructed  In  the  use  of  four  simple  actuarial  rules 
(e.g. , If  the  F-acale  Is  55  or  higher,  call  the  profile  psychiatric).  For 
profiles  51  through  100,  they  received  right/wrong  feedback  after  every  10 
profiles.  They  received  no  feedback  during  profiles  101-200.  The  other 
Inexperienced  judges  received  calibration  training  during  their  second  session. 
After  every  50  profiles,  they  were  told  their  percent  correct,  their 
calibration  score,  their  rank  within  the  group  on  both  these  measures,  and 
shown  their  calibration  curve.  The  experimenter  suggested  and  discussed 


The  MMPI  (Minnesota  Multlphaslc  Personality  Inventory)  Is  a personality 
Inventory  widely  used  for  psychiatric  diagnosis.  A profile  Is  a graph  of 
13  sub-scores  from  the  Inventory. 
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ways  of  Improving  each  subject' a calibration. 

Oskamp  used  three  measures  of  subjects'  performance:  accuracy  (percent 

correct) , confidence  (mean  probability  response) , and  appropriateness  of 
confidence  (a  calibration  score:  ^ En^jr^-c^l).  All  three  groups  were. 

In  general,  overconfident,  especially  the  undergraduates  In  their  first 
session  (accuracy  70Z,  confidence  .78).  However,  all  three  groups  were 
\mder confident  on  the  easy  profiles  (accuracy  87Z,  confidence  .83). 

The  subjects  trained  for  accuracy  Increased  their  accuracy  from 
67Z  to  73Z,  closer  to  their  confidence,  .78,  which  did  not  change  as  a 

4 

result  of  training.  Their  calibration  score  decreased  from  .17  to  .10. 

The  subjects  trained  for  calibration  lowered  their  confidence  from  .78 
to  .74,  bringing  It  closer  to  their  accuracy,  .68,  which  remained  unchanged. 
Their  calibration  score  decreased  from  .15  to  .11. 

Signal  Detection  Research 

In  the  early  days  of  signal  detection  research.  Investigators  looked 
Into  the  possibility  of  using  confidence  ratings  rather  than  Yes-No 
responses  In  order  to  reduce  the  amounts  of  data  required  to  determine  a 
stable  ROC  (receiver  operating  characteristic)  curve.  The  classic  Psychological 
Review  paper  by  Swets,  Tanner  and  Blrdsall  (1961,  the  same  volume  in  which  the 
Adamses'  review  appeared^  reported  individual  calibration  curves  for 
four  observers  who  used  a six-point  rating  scale  to  Indicate  their  confidence 
that  they  had  heard  a signal  plus  noise  rather  than  noise  alone.  The  ratings 
were  defined  on  a probability  scale,  the  first  point  representing  0.0  to  0.04, 
the  next  0.05  to  0.19,  followed  by  four  equal-width  categories,  0.20-0.39, 
0.40-0.59,  0.60-0.79,  0.80->1.00.  The  calibration  curves  of  the  four  subjects, 

SlMPI-buffs  might  note  that  with  this  minimal  training  the  undergraduates 
showed  as  high  an  accuracy  as  either  the  best  experts  or  the  best  actuarial 
prediction  systems. 


based  on  1,200  trials  each,  are  shown  In  Figure  5.  The  Individual  differences 
are  striking,  with  oily  one  subject  being  even  remotely  well  calibrated. 

Clarke  (1960)  reported  an  experiment  In  which  one  of  five  different 
words,  mixed  with  noise,  was  presented  to  listeners  through  headphones. 

The  listeners  selected  the  word  they  thought  they  heard  and  then  rated  their 
confidence  by  Indicating  one  of  five  categories  defined  by  slicing  the 
probability  scale  Into  five  ranges.  Twelve  practice  tests  of  75  Items  each 
helped  the  listeners  to  calibrate  themselves.  After  each  test,  listeners 
scored  their  own  results  and  noted  whether  the  appropriate  percentage  of 
correct  Identifications  fell  In  each  rating  category,  thus  allowing  them  to 
change  strategies  on  the  next  test.  Clarke  found  that  although  all  five 
listeners  appeared  well  calibrated  when  data  were  averaged  over  the  five 
stimulus  words,  analyses  for  Individual  words  showed  that  the  listeners 
tended  to  be  overconfident  for  low-lntelllglblllty  words  and  under confident 
for  words  of  relatively  high  Intelligibility.  As  we  show  In  the  next  section, 
this  pattern  of  findings,  overconfidence  for  difficult  Items  and  under confidence 
for  easy  Items,  has  been  obtained  In  different  tasks. 

Clarke  also  reported  an  experiment  In  which  both  the  slgnal-to-nolse 
ratio  and  the  number  of  alternatives  were  varied.  He  found  that  the  calibration 
curves  for  different  slgnal-to-nolse  ratios  were  nearly  Identical  when  only 
four  words  made  up  the  message  set.  But  when  any  one  of  16  words  was 
possible,  the  curves  appeared  well  calibrated  only  for  the  larger  slgnal-to- 
nolse  ratios,  deteriorating  of  overconfidence  at  smaller  slgnal-to-nolse 
ratios.  In  spite  of  their  training  In  using  the  rating  scale,  the  listeners 
adopted  different  response  criteria  for  different  stimulus  characteristics, 
thereby  shifting  their  calibration  curves. 

Pollack  and  Decker  (1958)  used  a verbally  defined  6-polnt  confidence 
rating  scale  that  ranged  from  "Positive  I received  the  message  correctly" 
to  "Positive  I received  the  message  Incorrectly."  With  this  rating  scale 
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It  Is  Impossible  to  determine  whether  an  Individual  Is  well  calibrated,  but 
It  Is  possible  to  see  shifts  In  calibration  across  conditions.  In  seeming 
contrast  to  Clarke's  results,  Pollack  and  Decker  showed  that  the  average 
calibration  curve  over  three  subjects  remained  unchanged  with  different  slgnal- 
to-nolse  ratios.  However,  when  subsets  of  difficult  Items,  medium  Items  and 
easy  Items  were  analyzed  separately,  the  Invariance  of  the  calibration 
curves  disappeared.  Calibration  curves  for  easy  words  generally  lay  above 
those  for  difficult  words,  whatever  the  slgnal-to-nolse  ratio,  and  the  curves 
for  high  slgnal-to-nolse  ratios  lay  above  those  for  low  slgnal-to-nolse 
ratios,  whatever  the  word  difficulty. 

In  another  experiment  on  message  reception.  Decker  and  Pollack  (1958) 
varied  the  frequency  cutoffs  for  the  noise  that  was  mixed  In  with  the 
speech.  For  one  subject,  calibration  was  unaffected  by  the  change  In  filters, 
but  for  the  other  two  subjects,  the  calibration  curve  for  the  lower-frequency 
filter  was  below  that  for  the  other  filter.  Here,  the  effect  of  task 
difficulty  on  calibration  depended  on  the  Individual. 

In  most  of  these  studies,  shifts  In  calibration  curves  were  of  secondary 
Interest:  the  Important  question  was  whether  confidence  ratings  would  yield 
the  same  ROC  curves  as  Yes-No  procedures.  To  answer  this  question.  It  Is 
not  necessary  to  define  rating  scales  In  terms  of  probabilities;  verbally- 
defined  categories  are  sufficient.  Thus,  the  probability  scale  disappeared 
from  signal  detection  research.  By  1966,  Green  and  Swets  concluded  that. 

In  general,  rating  scales  and  Tes-No  procedures  yield  almost  Identical  ROC 
curves.  Since  then,  studies  of  calibration  have  disappeared  from  the  signal 
detection  literature. 

Recent  Laboratory  Research 

Hazard  and  Peterson  (1973)  found  no  effect  on  calibration  due  to 

20 


k 


j 

I 


/ 


changes  In  response  mode.  Forty  subjects,  armed  forces  personnel  studying 
at  the  Defense  Intelligence  School,  responded  with  probabilities,  and  with 
odds,  to  50  two-altematlve  general  knowledge  Items  (e.g.,  which  magazine  had 
the  largest  circulation  In  1970,  Playboy  or  Time?) . using  the  half-range 
method.  Substantial  overconfidence  was  found,  as  shotm  In  Figure  6. 

Lichtenstein  (unpublished)  replicated  the  results,  using  the  same  Items  but 
only  the  probability  response,  with  19  Oregon  Research  Institute  employees. 
Phillips  and  Wright  (In  press)  found  similar  results  with  different  Items, 
using  British  undergraduate  students  as  subjects.  The  calibration  curves 
shown  In  Figure  6 look  remarkably  similar  considering  the  variety  of  subject 
populations  employed;  all  showed  gross  overconfidence. 

Using  the  same  half-range,  two-altematlve  method,  we  have  recently 
conducted  a series  of  experiments  exploring  calibration  (Lichtenstein  and 
Flschhoff,  1976).  We  will  briefly  review  our  findings  here. 

In  two  tasks  chosen  to  be  extremely  difficult,  subjects  were  poorly 
calibrated;  in  fact,  they  showed  no  evidence  of  calibration  at  all.  Figure  7 
shows  curves  for  these  tasks,  one  In  which  subjects  were  asked  to  identify 
small  sketches  as  drawn  by  European  or  Aslan  children,  and  one  In  which  they 
studied  stock  market  charts  and  were  asked  to  predict  whether  the  stock 
described  by  each  chart  would  be  up  or  down  3 weeks  hence.  Overall  percent 
correct  was  53%  for  children's  art,  47%  for  stocks.^ 

Even  a small  amount  of  substantive  knowledge  will  Induce  some 
improvement  In  calibration.  We  asked  two  other  groups  of  subjects  whether 
each  of  10  examples  of  handwriting  was  written  by  a European  or  an  American, 
after  they  had  studied  10  similar  examples.  All  examples  were  preselected 
to  be  difficult  to  Judge.  The  training  group's  study  examples  were  correctly 

^ We  caution  the  reader  against  trying  to  Interpret  the  fascinating  shape  (a 
fish?)  created  by  these  two  calibration  curves.  We  think  It's  a fluke  of  chance. 
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Hazard  & Peterson,  1973=  Probabilities 
Hazard  & Peterson,  1973=  Odds 
Phillips&  Wright  (this  volume) 
Lichtenstein  (unpubi  is  he  d) 


Subjects'  Response 


Figure  6 


Celibratlon  for  Half-Range  Tasks 


labeled  as  to  country  of  origin;  the  jio-tralnlng  group's  study  examples  ] 

were  unlabeled.  As  shown  In  Figure  8,  the  training  group,  who  correctly  | 

Identified  71Z  of  the  handwriting  examples,  were  much  better  calibrated  | 

than  the  no- training  group  (51Z  correct).  1 

We  pursued  the  notion  that  substantive  knowledge  affects  calibration  In  | 

several  additional  studies  using  two-altematlve  general  knowledge  Items.  | 

Substantive  knowledge  was  defined  for  subjects  by  the  proportion  of  Items  | 

they  correctly  answered  (best  or  worst  subjects)  and  for  Items  by  the  j 

proportion  of  correct  answers,  across  subjects,  for  each  Item  (easy  or  hard 
Items).  Figure  9 gives  results  for  50  graduate  students  pursuing  Ph.D.'s  In 

J 

psychology.  A replication  using  different  Items  and  a different  sample  of 
subjects,  undergraduate  student  volunteers,  showed  similar  results  (not 
graphed  here).  ■ 

These  curves  clearly  show  that  the  degree  of  over-  or  under-confidence 
Is  a function  of  substantive  knowledge.  The  most  knowledgeable  subjects 
answering  the  easiest  Items  showed  substantial  underconfidence,  while  < 

I; 

the  worst  subjects  on  the  hardest  Items  showed  substantial  overconfidence. 

The  relationship  between  Item  difficulty  and  over-  or  under-confidence  Is  mediated 
by  the  distribution  of  responses  given  by  subjects.  To  be  well  calibrated 
with  hard  Items,  an  assessor  must  use  many  responses  of  .5  and  .6  and  a few 
of  .9  and  1.0,  while  with  easy  Items  the  reverse  must  be  true  to  achieve 


good  calibration.  The  distributions  of  responses  for  the  four  calibration 
curves  shown  In  Figure  9 Indicate  that  the  subjects  did  change  their  distribu- 
tions, but  not  as  much  as  they  should  have.  Across  16  different  experiments 
or  sub-experiments  we  have  run  (Lichtenstein  and  Flschhoff,  1976)  using 
two-altematlve  half-range  tasks,  there  Is  a .91  correlation  between  the 
mean  response  over  all  subjects  and  Items  (range  .65  to  .86)  and  the  percent 
correct  over  all  subjects  and  Items  (range  43  to  92),  giving  further 
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Calibration  for  Handwriting 
Identification:  Training  versus  No  Training 


Calibration  for  Subsets 


evidence  that  subjecte  do  change  their  response  distributions  as  the 
difficulty  level  of  the  task  changes,  though  not  enough  to  achieve  good  cali- 
bration. 

The  calibration  curves  shown  In  Figure  9 were  not  calculated  from 
separate.  Independent  sets  of  data,  but  from  subsets  of  Items  end>edded  In 
a larger  set,  the  longer  test  given  to  each  subject.  To  guard  against  the 
possibility  that  there  Is  some  artlfactual  reason  for  these  findings,  due 
perhaps  to  an  adaptation  level  effect  operating  In  the  larger,  more  varied 
tests  the  subjects  actually  took,  we  prepared  two  tests,  one  hard  (50  Items) 
and  one  easy  (50  Items),  using  Items  that  had  previously  been  used  In  a 
large,  varied  test.  These  smaller  tests  were  given  to  two  new  groups  of 
subjects:  A8  subjects  took  the  hard  test,  45  the  easy.  Figure  10  shows 
that  the  calibration  from  these  two  separate.  Independent  tests  was 
essentially  the  same  2is  calibration  calculated  from  sub-tests  created 
artificially  (and  post  hoc)  from  a larger  set  of  data.  The  effect  of  test 
difficulty  shown  here  Is  not  an  artifact  due  to  our  method  of  analysis. 

Using  a full-range,  one-altematlve  task,  Pltz  (1974)  found  an  Item- 
difficulty  effect  similar  to  that  reported  above.  He  gave  38  subjects  12 
Items  concerning  the  population  of  various  countries  (e.g.,  "the  population 
of  Brazil  exceeds  85  million"),  and  an  unspecified  nunf>er  of  items  concerning 
the  grade  each  would  receive  In  Pltz's  course,  one  week  before  the  final 
exam.  The  population  Items  were  chosen  to  be  difficult,  the  course  grade 
items  easy.  The  divergence  of  the  two  calibration  curves  is  apparent  (see 
Figure  11). 

While  Pitz  did  not  report  percent  correct  for  either  group,  his  "hard 
Item"  calibration  curve  Is  similar  to  data  Flschhoff  and  Lichtenstein  (In 
preparation)  have  collected  with  the  two-alternative,  full-range  method  (see 


Figure  10 

Callbraclon  for  Hard  and  Easy  TesCs  Versus 
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Figure  11 

Calibration  for  Several  Full-Range  Studies 
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Figure  11).  In  our  study,  100  tvo-alternatlve  items  were  given  to  131 
subjects.  Half  the  subjects  were  told  to  assess  the  probability  that  the 
first  alternative  was  correct;  the  other  half  responded  to  the  second 
alternative.  The  data  from  the  two  groups  were  combined.  The  test  Items 
were  composed  of  two  subsets,  one  with  75  Items  of  moderate  difficulty 
(65%  correct)^  and  one  with  25  items  of  greater  difficulty  (55%  correct). 

Clearly,  the  pattern  of  Pltz's  results  for  hard  Items  was  repeated;  the 
calibration  was  abysmal. 

Perhaps  the  categorization  of  items  into  "hard"  and  "easy"  does  not 
really  capture  the  essence  of  expertise.  Experts  might  be  better  calibrated 
not  only  because  they  know  the  correct  answer  for  more  of  the  Items,  but 
also  because  they  have  thought  more  about  the  whole  topic  area,  and  thus 
can  more  readily  recognize  the  extent  and  the  limitations  of  their  knowledge. 

We  tested  this  hypothesis,  using  psychology  graduate  students  as  our  experts. 

They  responded  to  100  Items,  50  dealing  with  knowledge  of  psychology  and 
50  dealing  with  general  knowledge.  The  two  parts  of  the  test  were  analyzed 
separately.  The  percent  correct  was  the  same  (76%)  for  the  two  parts. 

Since  item  difficulty  was  controlled  for,  differences  In  calibration  could 
only  be  attributed  to  the  hypothesized  quality  of  Insight  that  experts  might 
have  above  and  beyond  their  level  of  knowledge.  As  shown  in  Figure  12,  no 
such  differences  were  found. 

^ In  the  full-range  method,  percent  correct  Is  calculated  as  follows:  when 

the  subject  responds  with  a probability  > .5,  we  count  the  successes;  when 
the  response  Is  .5,  we  count  half  the  responses,  under  the  assumption  that  the 
subject,  when  asked  to  choose  which  of  two  alternatives  Is  the  preferred  one, 
would  randomly  make  that  choice.  When  the  response  Is  < .5,  we  count  the 
failures:  If  you  say  the  probability  of  rain  tomorrow  Is  .1,  and  it  doesn't  rain, 

then  you  were  correct  In  believing  It  would  more  likely  not  rain  than  rain. 


Finally,  we  looked  at  the  effect  of  Intelligence  on  calibration.  Our 
usual  volunteers  were  mostly  undergraduate  college  students.  Our  graduate 
student  subjects  may  be  presumed  to  be  significantly  more  Intelligent,  as 
a result  of  highly  selective  admissions  procedures.  Figure  13  shows  the 
calibration  of  two  subtests  of  73  Items.  The  subtests  were  chosen  from 
previously  collected  data  so  that  each  Item  from  the  usual  volunteers  was 
matched  In  difficulty  (Z  correct)  by  an  Item  from  the  graduate  students. 

The  graduate  students  appear  to  be  slightly  better  calibrated  at  .5  and  1.0. 

The  differences  are  slight,  however,  when  compared  with  differences  In 
calibration  due  to  test  difficulty. 

Data  from  two  full-range  studies  are  shown  In  Figure  14.  Flschhoff  and 
Beyth  (1975)  asked  150  Israeli  university  students  to  assess  the  probability 
of  15  then-future  events,  possible  outcomes  of  President  Nixon's  much- 
publicized  trips  to  China  and  Russia.  Examples  of  the  events  are  "President 
Nixon  will  meet  Mao  at  least  once";  "The  USA  and  the  USSR  will  agree  to  a 
Joint  space  program";  "President  Nixon  will  announce  that  his  trip  was 
successful."  The  resulting  calibration  curve,  based  on  1,921  assessments.  Is 
suboptlmal  at  0 and  1,  and  shows  a dip  at  .7,  but  Is  otherwise  remarkably 
close  to  the  Identity  line.  Why?  Tlie  subjects  received  the  usual  Instructions. 
They  were  not  experienced  In  probability  assessments.  They  were  run  In  large 
classroom  groups.  They  were  not  foreign-affairs  experts.  Is  this  ability  a 
special  attribute  of  Israelis? 

Sleber  (1974)  had  20  subjects  assess  probabilities  for  all  four 
multiple-choice  alternatives  of  20  Items  In  a college  classroom  exam.  All 
1600  responses  are  included  In  this  curve.  A large  proportion  of  the 
responses  (77Z)  were  of  the  form  (1,  0,  0,  0),  and  for  these  responses  the 
calibration  was  superb:  the  percent  correct  was  98.7.  The  rest  of  the 

curve  (see  Figure  14)  Is  based  on  few  data.  It  Is  difficult  to  know  to 
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what  extent  the  apparent  symmetry  about  the  point  (1/4,  1/4)  Is  forced  on 
the  curve  by  the  inclusion  of  all  four  responses  to  each  item. 


The  primary  purpose  of  Sleber's  experiment  was  to  study  the  effect  of 
motivation  on  calibration.  The  subjects  whose  data  are  plottcKl  here  were 
told  that  the  score  they  earned  on  the  test  (based  on  a proper  scoring  rule) 

1 

would  not  count  in  their  grade.  Another  group  was  told  their  score  would 

count  in  their  grade.  The  latter  (highly  motivated)  group  used  (1,  0,  0,  0)  I 

for  90Z  of  their  responses.  Their  calibration  (not  plotted  here)  appears 

worse,  but  so  little  data  are  available  for  tlie  curve  (aside  from  the  end 

points)  that  one  should  be  cautious  in  drawing  any  conclusion. 

In  a stock  market  prediction  task,  Stael  von  Holstein  (1972)  asked 
subjects  to  assess  probabilities  for  a five-alternative  task:  the  future 

movement  of  stocks  categorized  into  five  Intervals  fixed  by  the  experimenter. 

He  did  not  report  the  data  necessary  to  compute  a calibration  curve,  except 
to  note,  tantallzingly,  that  of  7,896  distributions  only  40  were  of  the 
extreme  form  (1,  0,  0,  0,  0).  Of  these,  only  12  were  correct! 

The  full-range  studies  based  on  laboratory  research,  shown  in  Figures 
11  and  14,  indicate  symmetric  calibration:  the  proportion  correct  for  any 

response  ^ is  approximately  equal  to  one  minus  the  proportion  correct  for 
the  response  1-^.  In  contrast,  the  full-range  calibration  curves  from  the 
weather  forecasting  studies  shown  in  Figures  1 and  2,  are  not  (except  for 
Root,  1962)  symmetric:  they  show  a constant  bias  across  the  entire  range. 

It  is  tempting  to  believe  that  whether  a calibration  curve  shows  symmetry 

or  bias  depends  on  the  implicit  payoff  structure  for  different  kinds 

of  error.  Forecasters  may  prefer  to  forecast  rain  and  be  wrong  than  to 

forecast  no  rain  and  be  wrong.  But  it  seems  unlikely  that  laboratory 

subjects  perceive  differential  penalties  for  saying  absinthe  is  a liqueur 

and  finding  out  it  is  a precious  stone  versus  saying  it  is  a precious 

stone  and  finding  out  it  is  a liqueur.  | 
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Some  Problems 


A rarely-discussed  problem  In  measuring  an  assessor's  calibration  is 
the  large  number  of  assessments  needed  to  provide  a stable  estimate.  One 
way  to  reduce  the  nind>er  of  responses  required  is  to  assume  that  the  calibration 
curve  is  one  of  a family  of  curves,  and  use  the  data  to  estimate  the  parameter! 
of  the  curve.  Shuford  and  Brown  (1975;  see  also  Brown  and  Shuford,  1973) 
assumed  that  calibration  curves  are  straight  lines,  and  found  least  squares 
estimates  of  the  slope  and  Intercept  for  each  subject.  The  model  becomes  a 
one-parameter  (slope)  mucel  when,  for  n items  with  k alternatives,  the  subject 
gives  responses  to  all  alternatives  and  all  nk  responses  are  fitted  by  the 
model.  Provided  that  the  sum  of  the  k responses  to  a single  item  is  always 
1.0,  the  fitted  line  Js  constrained  in  their  model  to  pass  through  the  point 
(1/k,  1/k).  Using  3-altematlve  items,  Shuford  and  Brown  reported,  without 
supporting  detail,  that  "as  long  as  a reasonably  wide  range  of  [responses] 
is  used  by  the  [subjects],  this  estimation  procedure  can  yield  fairly  stable 
results  with  15-  and  20-item  tests"  (1975,  p.  157).  However,  the  authors 
were  concerned  that  their  model  assumes  that  all  responses  are  Independent, 
and  suggested  that  when  more  than  two  alternatives  are  used,  this  might  not 
be  true  because  "some  people  might  tend  to  overvalue  information  when  deducing 
reasons  in  favor  of  an  answer,  but  tend  to  undervalue  information  when 
deducing  reasons  against  an  answer"  (p.  157).  To  solve  this  problem,  they 
proposed  a planar  least-squares  estimation  procedure  for  the  special  case  of 
three  alternatives.  The  planar  model,  however,  did  not  produce  stable 
estimates  for  small  nund>ers  of  items. ^ 

Schlalfer  (1971),  in  his  MANECON  program  called  TBUCHANCE,  proposed  a 
one-parametar  model  which  is  linear  in  the  log  of  the  odds  of  the  response 
(r)  plotted  against  the  log  of  the  odds  of  the  proportion  correct  (c) : 

■ A + log 

^T.  A.  Brown,  peraonel  comBunicacion,  March  3,  1975. 


His  program  uses  a Bayesian  approach  to  finding  the  posterior  distribution  of 
the  paraaeter  A,  given  a set  of  responses,  and  uses  that  distribution  to  re- 
calibrate future  responses.  This  model  Is  somewhat  limited.  The  only 
forms  of  mlscallbratlon  It  can  recognize  are  curves  always  above  the  diagonal 
or  always  below  it.  Such  a model  could  not  adequately  represent  the  symmetric 
full-range  data  shown  In  Figure  1 (Boot,  1962)  and  Figure  11. 

We  have  recently  been  exploring  the  use  of  models  to  Improve  the 
stability  of  estimates  of  calibration  (Phillips  and  Lichtenstein,  In  prepar- 
ation) • using  both  a two-parameter  linear  model  and  a two-parameter  expansion 
of  Schlalfer's  model: 

log  - A + B log  * 

We  are  less  sanguine  than  Shuford  and  Brown  about  the  number  of  Items 
required  for  stable  estimation.  Consider  an  assessor  who  Is  so  badly 
calibrated  that  she  says  .2  when  she  ought  to  say  .35,  and  says  .8  when  she 
ought  to  say  .7.  Preliminary  results  with  simulated  data  Indicate  that  the 
probability  that  such  an  assessor  will  appear  to  be  perfectly  calibrated  can 
be  as  high  as  .5  for  a 100-ltem  test. 

The  need  for  accurate  estimates  of  calibration  with  the  fewest  possible 
data  Is  most  pressing  when  one  considers  the  problem  of  training  an  assessor 
to  become  better  calibrated.  An  obvious  design  for  a training  experiment 
would  be  to  run  a subject  for,  say,  eight  sessions.  At  the  end  of  each 
session,  we  would  give  her  feedback,  telling  her  about  her  calibration  and 
urging  her  to  Improve  It.  If  we  collect  too  few  data  per  session,  we  stand 
a large  chance  of  giving  her  false  feedback — telling  her,  for  example,  that 
she  is  consistently  underconfident,  when  in  fact  she  is  really  overconfident. 
In  addition,  the  experimenter  In  such  a study  would  have  little  power  (In 
the  statistical  sense)  to  conclude,  after  the  experlemnt,  that  training  led 


to  Improvement.  On  the  other  hand,  preparing  and  presenting  800  to  1600 
stimuli  (100  to  200  per  session)  presents  problems  for  both  the  experimenter 
and  the  subject. 

Brown  and  Shuford  (1973)  have  suggested  two  ways  of  dealing  with  this 
problem:  (1)  Give  subjects  scoring-rule  feedback  after  every  Item.  This  might 

serve  to  keep  subjects  Interested  and  learning.  (2)  Give  calibration 
feedback  after  every  N Items.  This  feedback  would  be  the  straight  line 
fitted  to  the  data.  They  further  suggest  that  all  responses  to  each  Item, 
not  Just  one  response,  be  fitted.  We  believe  that  using  all  the  data  might 
work  for  those  situations  where  a constant  bias  Is  unlikely,  such  as  when 
using  diversified  Items  of  general  Information.  But  when  the  Items  are 
repeated  presentations  of  the  same  question,  such  as  "Will  It  rain  tomorrow?", 
the  Inclusion  of  both  responses  to  each  Item  would  tend  to  obscure  the  kind 
of  bias  shown  In  Figures  1 and  2. 

One  further  problem  In  training  assessors  Is  the  possibility  that  the 
assessor  will  trade  off  Information  transmission  for  calibration.  At  the 
extreme,  an  assessor  could  always  respond  with  the  base  rate  (the  overall 
proportion  of  correct  propositions),  thus  yielding  excellent  calibration 
but  no  Information.  To  avoid  this  strategy,  it  might  be  wise  to  feed  back 
to  the  trainee  Murphy's  vector  partitions  of  the  scoring  rule  (or,  where  appro- 
priate, the  special  scalar  partitions)  at  the  end  of  every  session.  Hopefully, 
the  subject  would  learn  to  Improve  the  calibration  portion  of  the  score  without 
greatly  decreasing  the  resolution  portion.  In  addition,  one  would  wish  to 
show  the  trainee,  perhaps  via  a calibration  curve  smoothed  by  a fitted  model, 
whether  poor  calibration  was  due  to  overconfidence  or  underconfidence. 

Our  previous  finding  that  subjects  tend  to  be  overconfident  with  hard 
Items  and  underconfident  with  easy  Items  adds  to  the  dilemma  one  faces  In 
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planning  a training  experiment.  Those  data  suggest  that  one  might  have  to 
train  sid>Jects  In  both  hard  and  easy  tasks,  separately,  to  have  any  hope 
that  the  training  would  generalize. 

CCXfTlNUOUS  PROPOSITIONS:  UNCERTAIN  QUANTITIES 

Continuous  uncertain  quantities  can  be  proportions  (What  proportion 
of  students  prefer  Scotch  to  Bourbon?)  or  numbers  (What  Is  the  shortest 
distance  from  England  to  Australia?).  Subjects  are  usually  not  asked  to 
draw  the  entire  density  function  across  the  range  of  possible  values. 

The  elicitation  procedure  most  comnonly  used  Is  some  variation  of  the 
fractlle  method.  In  this  method,  the  subject  Is  asked  to  give  the  median 
of  the  distribution  ("state  a value  such  that  the  true  value  Is  equally 
likely  to  fall  above  or  below  the  value  you  state"),  and  then  several  other 
frac tiles.  For  example,  for  the  .01  fractlle  the  subject  would  be  asked 
to  state  a value  such  that  there  Is  only  1 chance  In  100  that  the  true  value 
Is  smaller  than  the  stated  value.  In  one  variant  called  the  tertlle  method, 
the  subject  is  not  asked  the  median.  He  is  asked  to  state  two  values 
(the  .33  and  .67  fractlles)  such  that  the  entire  range  Is  divided  Into  three 
equally  likely  sections. 

The  most  common  calibration  analysis  is  to  calculate  the  interquartile 
Index,  which  Is  the  percent  of  Items  for  which  the  true  value  falls  inside 
the  Interquartile  range  (l.e.,  larger  than  the  value  associated  with  the 
25th  fractlle,  but  smaller  than  the  value  associated  with  the  75th  fractlle), 
and  to  calculate  the  "surprise  index,"  which  is  the  percent  of  true  values 
that  fall  outside  the  most  extreme  fractlles  assessed.  The  perfectly 
calibrated  person  will,  in  the  long  run,  have  an  Interquartile  Index  of  50. 
When  the  most  extreme  fractlles  assessed  are  .01  and  .99,  then  the  perfectly 
calibrated  person  will  have  a surprise  Index  of  2. 


The  Impetus  for  Investigating  the  calibration  of  probability  density 
functions  came  from  an  unpublished  paper  by  Alpert  and  Half fa  (1969), 
surely  the  most  referenced  rough  draft  In  the  literature  of  decision  making. 
Alpert  and  Ralffa  worked  with  four  groups  of  subjects,  all  students  enrolled 
In  courses  given  by  the  Harvard  Business  School,  and  all  familiar  with  the 
fundamentals  of  decision  analysis.  In  their  first  experiment,  all  subjects 
assessed  five  fractlles,  three  of  which  were  .25,  .50,  and  .75.  The  extreme 
fractlles  were,  however,  different  for  the  different  subgroups,  .01  and  .99 
(Group  A);  .001  and  .999  (Group  B);  "the  minimum  possible  value"  and  "the 
maximum  possible  value"  (Group  C);  and  "astonishingly  low"  and  "astonishingly 
high"  (Group  D).  The  Interquartile  and  surprise  Indices  for  these  four 
subgroups  are  shown  In  Table  1.  Alpert  and  Ralffa,  discouraged  by  the 
enormous  number  of  surprises,  then  ran  three  additional  groups  who,  after 
assessing  10  uncertain  quantities,  received  feedback  In  the  form  of  an 
extended  report  and  explanation  of  the  results,  along  with  perorations  that 
In  Che  future  the  subjects  should  "Spread  Those  Extreme  Fractlles!"  (p.  13). 
The  subjects  then  responded  to  10  new  uncertain  quantities.  Results  before 
and  after  training  are  shown  In  Table  1.  All  groups  showed  some  Improvement 
with  training.  The  greatest  changes  were  shown  by  Group  4,  the  only  group 
of  subjects  who  were  not  exclusively  from  the  Harvard  Business  School,  but 
were  enrolled  In  a decision  analysis  course  designed  for  students  from  other 
departments. 

Alpert  and  Ralffa  experimented  with  fitting  a beta  function  to  the 
.25,  .50,  and  .75  fractlles  for  a few  subjects'  responses  to  proportion 
questions  (e.g.,  what  proportion  of  students  answering  this  questionnaire 
prefer  Bourbon  to  Scotch?).  The  extreme  fractlles  of  the  fitted  beta, 
rather  than  those  the  subjects  actually  gave,  were  used  to  compute  the 
surprise  Index.  This  technique  led  to  no  Improvement,  suggesting  that 


TABLE  1 

Calibration  Summary  for  Continuous  Items: 

Percent  of  True  Values  Falling  Within  Interquartile  Range 
and  Outside  the  Extreme  Fractlles 


Interquartile  Surprise 

Index^  Index 


Observed 

Observed 

Ideal 

Alpert  & Ralffa  (1969) 

Group  1-A  (.01,  .99) 

880 

/ 

46 

2 

Group  1-B  (.001,  .999) 

500 

33  1 

40 

.2 

Group  1-C  ("min"  & "max") 

700 

47 

7 

Group  1-D  ("astonishingly  high/low") 

700 

L 

38 

7 

Groups  2 & 3 Before 

1670 

33 

39 

2 

After 

1670 

44 

23 

2 

Group  4 Before 

600 

36 

21 

2 

After 

600 

43 

9 

2 

ilesslon  & McCarthy  (1974) 

2035 

25 

47 

2 

Selvldge  (1975) 

Five  Fractlles 

400 

56 

10 

2 

Seven  Fractlles  (incl.  .1  & .9) 

520 

50 

7 

2 

Schaefer  & Borcherdlng  (1973) 

1st  Day,  Fractlles 

396 

23 

39 

2 

4th  Day,  Fractlles 

396 

38 

12 

2 

1st  Day,  Hypothetical  Sample 

396 

16 

50 

2 

4th  Day,  Hypothetical  Sample 

396 

48 

6 

2 

Pickhardt  & Wallace  (1974) 

Group  1,  First  Round 

9 

39 

32 

2 

Fifth  Round 

? 

49 

20 

2 

Group  2,  First  Round 

? 

30 

46 

2 

Sixth  Round 

9 

45 

24 

2 

Pratt  & Pratt  (Personal  Communication) 

"Astonishingly  high/low" 

175 

37 

5 

7 

Brown  (1973) 

414 

29 

42 

2 

Seaver,  von  Winterfeldt,  & Edwards  (1975) 
Fractlles  160 

42 

34 

2 

Odds-Fractiles 

160 

53 

24 

2 

Probabilities 

180 

57 

5 

2 

Odds 

180 

47 

5 

2 

Log  Odds 

140 

31 

20 

2 

Murphy  & Winkler  (1974) 

Extremes  were  .125  & .875 

132 

45 

27 

25 

Murphy  & Winkler  (this  volume) 

Extremes  were  .125  & .875 

432 

54 

21 

25 

Stael  von  Holstein  (1971) 

1269 

27 

30 

2 

^ N Is  the  total  number  of  assessed  distributions. 

^ The  Ideal  percent  of  events  falling  within  the  interquartile  range  is  50,  for 
all  experiments  except  Brown  (1973).  He  elicited  the  .30  and  .70  fractlles, 
so  the  Ideal  Is  40%. 
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Che  problem  does  not  reside  solely  In  subjects'  Inability  to  give 
sufficiently  extreme  .01  and  .99  fractlles,  but  In  their  .25  and  .75  frac- 
tlles  as  well. 

Hesslon  and  McCarthy  (1974)  collected  data  comparable  to  Alpert  and 
Ralffa's  first  session,  using  55  uncertain  quantities  and  37  graduate  students 
as  subjects.  In  their  Instructions,  they  urged  subjects  to  make  certain 
that  Che  Interval  between  the  .25  fractlle  and  the  .75  fractlle  did  Indeed 
capture  half  of  the  probability.  "Later  discussion  with  Individual  subjects 
made  It  clear  that  this  consistency  check  resulted  In  most  cases  In  a 
readjustment,  decreasing  the  Interquartile  range  originally  assessed"  (p.  7), 
thus  making  matters  worse!  This  Instructional  emphasis,  not  used  by  Alpert 
and  Ralffa,  may  explain  why  Hesslon  and  McCarthy's  subjects  were  so  badly 
calibrated,  as  shown  Ip  Table  1. 

Hesslon  and  McCarthy  also  gave  their  subjects  a number  of  "personality" 
tests  they  thought  might  be  related  to  Individual  differences  In  calibration: 
Che  F (Authoritarian)  Scale,  the  Dogmatism  Scale,  the  Gough-Sanford  Rigidity 
Scale,  Pettigrew's  Category-width  Scale,  and  a group- administered  Intelligence 
scale.  The  correlations  of  these  tests  with  the  Interquartile  Index  and  the 
surprise  Index  across  subjects  were  mostly  quite  low,  although  the  F scale 
showed  a hint  of  a relationship  with  calibration,  correlating  -.31  with  the 
Interquartile  score  and  +.47  with  the  surprise  score  (N  ■ 28). 

Selvldge  (1975)  extended  Alpert  and  Ralffa's  work  by  first  asking 
subjects  four  questions  about  themselves  (e.g.,  do  you  prefer  Scotch  or 
Bourbon?).  The  responses  were  then  used  to  find  the  true  answer  for  what 
we  will  call  "group-generated"  uncertain  quantities  (e.g.,  how  many  of  the 
500  students  answering  the  questionnaire  preferred  Scotch  to  Bourbon?).  One 
group  gave  five  fractlles,  .01,  .25,  .5,  .75,  and  .99.  Another  group  gave 
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chose  five  plus  cvo  others,  .1  sod  .9.  As  shown  In  Table  1,  Che  group 
with  two  additional  fractlles  did  better.  These  results  are  not  as  different 


from  the  results  of  Alpert  and  Ralffa  as  they  appear.  Two  of  Alpert  and 
Ralffa's  uncertain  quantities  were  group-generated  proportions  which  were 
similar  to  Selvldge's  Items.  On  these  two  Items  only,  Alpert  and  Ralffa 
found  58Z  In  the  Interquartile  range  and  17Z  surprises.  These  results  are 
much  more  similar  to  Selvldge's  results  than  were  their  results  for  the  entire 
10-ltem  set.  Selvldge  also  reported  surprise  Indices  of  lOZ  for  extremes 
of  .01  and  .99  and  24Z  for  extremes  of  .1  and  .9,  using  five  fractlles. 
Finally,  when  she  asked  subjects  to  give  .25,  .5  and  .75  first,  and  then 
to  give  .01  and  .99,  she  got  fewer  surprises  (8Z)  than  when  the  order  was  re- 
versed (16Z). 

Schaefer  and  Borcherdlng  (1973)  explored  the  effects  of  training.  They 


ran  22  university  student  subjects  for  four  sessions,  using  18  group- 
generated proportions  per  session.  Each  subject  used  two  assessment 

techniques:  (1)  the  fractlle  method  (.01,  .125,  .25,  .5,  .75,  .875,  .99), 

and  (2)  the  hypothetical  sample  method.  In  the  latter  method,  subjects 

are  asked  to  state  the  sample  size,  n,  and  the  number  of  successes,  r,  of 

a hypothetlceil  sample  which  best  reflects  their  knowledge  about  the  uncertain 

quantity.  The  larger  n Is,  the  more  certain  they  are  of  the  true  value  of 

the  proportion.  The  ratio  r/n  reflects  the  mean  of  the  distribution  of 

their  uncertainty.  Subjects  had  great  difficulty  with  this  method,  despite 


In  the  interquartile  range)  than  the  fractlle  method. 

Plckhardt  and  Wallace  (1974)  replicated  Alpert  and  Ralffa's  findings, 
with  variations.  Across  several  groups  they  reported  38  to  48%  surprises 
before  feedback,  and  not  less  than  30%  surprises  after  feedback.  Two 
variations,  using  or  not  using  course  grade  credit  as  a reward,  and  using 
or  not  using  scoring  rule  feedback,  made  no  difference  In  the  number  of 
surprises.  Plckhardt  and  Wallace  also  studied  the  effects  of  extended  training. 
TWO  groups  of  18  and  30  subjects  (number  of  uncertain  quantities  not 
reported)  responded  for  five  and  six  sessions  with  calibration  feedback 
after  every  session.  Modest  Improvement  was  found,  as  shown  In  Table  1. 

Finally,  Plckhardt  and  Wallace  studied  the  effects  of  increasing 
knowledge  on  calibration  In  the  context  of  a realistic  declslon-maklng 
exercise:  a production  simulation  game  called  PROSIM.  Thirty-two  graduate 

students  each  made  51  assessments  during  a simulated  17  "days"  of  production 
scheduling.  Each  assessment  concerned  an  event  that  would  occur  1,  2 or  3 
"days"  hence.  The  closer  the  time  of  assessment  to  the  time  of  the  event,  the 
more  the  subject  knew  about  the  event.  This  Increased  Information  did 
affect  calibration:  there  were  32%  surprises  with  3-day  lags,  24%  with  2-day 

lags,  and  7%  with  1-day  lags.  No  Improvement  was  observed  over  the  17  "days" 
of  the  simulation. 

Q 

Pratt  asked  a single  expert  to  predict  movie  attendance  for  175 
movies  or  double  features  shown  In  two  local  theaters  over  a period  of 
more  than  one  year.  The  expert  assessed  the  median,  quartlles,  and 
"astonishingly  high"  and  "astonishingly  low"  values.  As  shown  in  Table  1, 
the  Interquartile  range  tended  to  be  too  small.  Despite  the  fact  that  the 
expert  received  outcome  feedback  throughout  the  experiment,  the  only  evidence 
of  Improvement  In  calibration  over  time  came  In  the  first  few  days. 

*J.  W.  Pratt,  paraonal  coannmlcatlon,  October,  1975. 
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Brown  (1973)  reported  calibration  results  for  31  subjects  responding 
to  14  uncertain  quantities  with  fractlles  .01,  .10,  .30,  .50,  .70,  .90,  and 
.99.  The  results,  shown  In  Table  1,  are  particularly  discouraging,  because 
each  question  was  accompanied  by  extensive  historical  data  (e.g.,  for  "Where 
will  the  consumer  price  Index  stand  In  December,  1970?",  subjects  were  given 
the  consumer  price  Index  for  every  quarter  between  March,  1962,  and  June, 

1970).  For  11  of  the  questions,  had  the  subjects  given  the  historical 
minimum  as  their  .01  fractlle  and  the  historical  maximum  as  their  .99  fractlle, 
they  would  have  had  no  surprises  at  all.  The  other  three  questions  showed 
strictly  Increasing  or  strictly  decreasing  histories,  and  the  true  value  was 
close  to  any  simple  approximation  of  the  historical  trend.  The  subjects 
must  have  been  putting  a large  emphasis  on  their  own  erroneous  knowledge  to 
have  given  distributions  so  tight  as  to  produce  42%  surprises. 

Brown  also  reported  unpublished  data  of  Norman  Dalkey  and  Bernice 
Brown,  who  elicited  quartlle  assessments  for  uncertain  quantities  and  found, 
for  1,218  cases,  31%  of  the  true  answers  fell  Inside  the  Interquartile  range. 

Seaver,  von  Wlnterfeldt,  and  Edwards  (1975)  studied  the  effects  of 
five  different  response  modes  on  calibration.  Two  groups  used  the  fractlle 
method,  responding  In  units  of  the  uncertain  quantity  to  either  fractlle 
(.01,  .25,  .50,  .75,  .99)  or  the  odds  equivalents  of  those  fractlles 
(1:99,  1:3,  1:1,  3:1,  99:1).  Three  other  groups  responded  with  probabilities, 
odds,  or  odds  on  a log-odds  scale  to  one-altemative  questions  which  specified 
a particular  value  of  the  uncertain  quantity  (e.g.,  what  is  the  probability 
that  the  population  of  Canada  In  1973  exceeded  25  million?).  Five  such 
questions  were  given  for  each  uncertain  quantity.  For  each  group,  seven  to 
nine  subjects,  undergraduate  and  graduate  students,  responded  to  20  uncertain 
quantities.  As  shown  In  Table  1,  the  groups  giving  probabilistic  and  odds 
responses  had  distinctly  better  surprise  Indices  than  those  using  the  fractlle 
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method.  The  log  odds  response  mode  did  not  work  out  well. 

Four  experiments  used  weather  forecasters  for  subjects.  In  two 
experiments  Murphy  and  Winkler  (1974;  and  In  press),  using  the  variable- 
width,  fixed-probability  parallel  to  the  earlier  described  fixed-width, 
variable-probability  experiment  (which  we  analyzed  as  a discrete  task), 
asked  subjects  to  give  five  fractlles  (.125,  .25,  .5,  .75,  .875)  for 
tomorrow's  high  temperature.  The  results,  shown  In  Table  1,  Indicate 
excellent  calibration.  These  subjects  had  fewer  surprises  In  the  extreme 
25!?  of  the  distribution  than  did  most  of  Alpert  and  Ralffa's  subjects  In 
the  extreme  2%I  Murphy  and  Winkler  found  that  the  five  subjects  In  the  two 
experiments  who  used  the  variable-width  technique  were  better  calibrated  than 
the  four  subjects  using  the  fLxed-wldth  technique.  Pltz  (1974),  however, 
using  a wl thin-subject  design  with  44  college-student  subjects,  reported  that 
the  fractlle  technique  led  to  worse  calibration  than  the  flxed-^wldth  tech- 
nique, as  did  Seaver,  von  Wlnterfeldt  and  Edwards  (1975). 

Peterson,  Snapper  and  Murphy  (1972)  asked  for  only  three  fractlles 
(.25,  .5,  and  .75)  for  tomorrow's  high  temperature.  Of  55  events,  51% 
fell  Inside  the  Interquartile  range,  16%  fell  on  one  of  the  boundaries, 
and  33%  fell  outside.  This  bit  of  data  contains  no  evidence  of  poor 
calibration. 

Sta€l  von  Holstein  (1971)  used  three  fixed-interval  tasks  : Average 

temperature  tomorrow  and  the  next  day  (dividing  the  entire  response  range 
Into  8 categories),  average  temperature  four  and  five  days  from  now  (8 
categories),  and  total  amount  of  rain  In  the  next  five  days  (4  categories). 
From  each  set  of  responses  (4  or  8 probabilities  summing  to  1.0),  he 
estimated  the  underlying  cumulative  density  function.  He  then  combined 
the  1,269  functions  given  by  28  participants.  He  reported  an  undue  number 


of  surprises:  25Z  of  the  true  snswers  fell  below  the  Inferred  .07  fractlle, 

and  25Z  fell  above  the  . 79  fractlle.  Using  the  group  cumulative  density 
function  shown  In  his  paper,  we  have  estimated  the  surprise  and  Interquartile 
Indices  (see  Table  1).  In  contrast  to  the  studies  by  Hurphy  and  Winkler 
and  by  Peterson,  Snapper  and  Murphy,  these  weather  forecasters  were  quite 
poorly  calibrated.  StaSl  von  Holstein's  task  was  essentially  similar  to 
Murphy  and  Winkler's  (197A)  fixed- Interval  task.  We  have  reviewed  the 
former  here  and  the  latter  In  the  section  on  discrete  tasks  simply  because 
that  Is  the  way  the  authors  summarized  their  data. 

Barclay  and  Peterson  (1973)  compared  the  tertlle  method  (l.e.,  the 
fractlles  .33  and  .67)  with  a "point"  method  In  which  the  assessor  Is 
asked  to  give  the  modal  value  of  the  uncertain  quantity,  and  then  two 
values,  one  above  and  one  below  the  mode,  each  of  which  Is  half  as  likely 


to  occur  as  Is  the  modal  value  (l.e 


points  for  which  the  probability 


density  function  Is  half  as  high  as  at  the  mode) . Using  10  almanac 


questions  as  uncertain  quantities  and  70  students  at  the  Defense  Intelligence 


School  In  a wl thin-subject  design,  they  found  for  the  tertlle  method  that  29% 


(rather  than  33%)  of  the  true  answers  fell  in  the  central  interval.  For 


the  point  method,  only  39%  fell  between  the  two  half-prob ab le  points,  whereas 


for  most  distributions,  approximately  75%  of  the  density  falls  between  these 


points 


Pitz  (1974)  reported  several  results  using  the  tertlle  method.  For  19 


subjects  estimating  the  populations  of  23  countries,  he  found  only  16%  of 


the  true  values  falling  inside  the  central  33  percentile.  He  called  this 


effect  "hyperpr eels Ion."  In  another  experiment  he  varied  the  items 


according  to  the  depth  and  richness  of  knowledge  he  presumed  his  subjects 


to  have.  With  populations  of  countries  (low  knowledge)  he  found  23%  of 


the  true  values  in  the  central  third;  with  heights  of  well-known  buildings 
(middling  knowledge),  27Z;  and  with  ages  of  famous  people  (high  knowledge), 

47Z,  the  last  being  well  above  the  expected  33%.  In  yet  another  study,  he 
asked  six  subjects  to  assess  tertlles,  and  a few  days  later  to  choose  among 
bets  based  on  their  own  tertlle  values.  He  found  a strong  preference  for 
bets  Involving  the  central  region.  Just  the  reverse  of  what  their  too-tlght 
Intervals  should  lead  them  to.  Pltz  suggested  that  the  point  estimate  (the 
most  likely  value  of  the  quantity)  was  over-controlling  their  choices. 

The  overwhelming  evidence  from  research  on  uncertain  quantities  Is 
that  people's  probability  distributions  tend  to  be  too  tight.  The  assessment 
of  extreme  fractlles  Is  particularly  prone  to  bias.  Training  Improves 
calibration  somewhat.  Experts  sometimes  perform  well  (Murphy  and  Winkler, 

1974,  in  press;  Peterson,  et  al.,  1972),  sometimes  not  (Stael  von  Holstein, 
1971).  There  Is  only  scattered  evidence  that  difficulty  Is  related  to 
calibration  for  continuous  propositions.  Pltz  (1974)  found  such  an  effect, 
and  Plckhardt  and  Wallace's  (1974)  finding  that  1-day  lags  led  to  fewer 
surprises  than  3-day  lags  In  their  simulation  game  Is  relevant  here.  Several 
studies  (e.g.,  Barclay  and  Peterson,  1973;  Murphy  and  Winkler,  1974)  have 
reported  a correlation  between  the  spread  of  the  assessed  distribution  and 
the  absolute  difference  between  the  assessed  median  and  the  true  answer. 
Indicating  that  subjects  do  have  a partial  sensitivity  to  how 
much  they  do  or  do  not  know.  This  finding  parallels  the  finding,  with  discrete 

propositions,  of  a correlation  between  percent  correct  and  mean  response. 

9 

Pratt's  expert  showed  no  such  correlation. 

DISCUSSION 

Why  should  an  assessor  worry  about  being  well  calibrated?  Von  Wlnterfeldt 
and  Edwards  (1973)  have  shown  that.  In  most  real-world  decision  problems, 

9 

J.  W.  Pratt,  personal  communication,  November  13,  1975. 
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fairly  large  errors  make  little  difference  In  the  expected  gain;  "A 
suboptlmal  choice  does  not  seriously  hurt  the  decision  maker  as  long  as 
the  alternative  selected  Is  not  grossly  away  from  the  optimum"  (p.  1). 

We  can  see  at  least  two  types  of  situations  In  which  calibration  does  make 
a difference.  First,  In  a two-altematlve  situation,  the  payoff  function 
can  be  quite  steep  In  the  crucial  region.  Suppose  your  doctor  must  decide 
the  probability  that  you  have  condition  A,  and  should  receive  treatment  A, 
versus  having  condition  B and  receiving  treatment  B.  Suppose  that  the 
utilities  are  such  that  treatment  A Is  better  If  the  probability  that  you 
have  condition  A Is  >.4,  as  shown  In  Figure  15.  If  the  doctor  assesses 
the  probability  that  you  have  A as  p(A)>.45,  but  Is  poorly  calibrated,  so 
that  he  should  have  said  .35,  then  he  would  treat  you  for  B Instead  of  A 
and  you  would  lose  quite  a chunk  of  expected  utility.  Real-life  utility 
functions  of  just  this  type  are  shown  in  Fryback  (1974). 

Secondly,  even  If  the  expected  lose  function  for  poor  calibration  Is 
quite  flat,  the  payoffs  may  be  so  large,  and  the  errors  so  large,  that 
the  expected  loss  looms  large.  Weatherwax  (1975),  In  critiquing  the  $3 
million  Rasmussen  report  on  nuclear  power  safety  (AEC,  1974)  noted  that 
"at  each  level  of  the  analysis  a log-normal  distribution  of  failure  rate 
data  was  assumed  with  5 and  95  percentile  limits  defined"  (p.  31).  The 
research  reviewed  here  suggests  that  distributions  built  from  assessments 
of  the  .05  and  .95  fractlles  may  be  grossly  biased.  If  such  assessments 
are  made  at  several  levels  of  an  analysis,  with  each  assessed  distribution 
being  too  narrow,  the  errors  will  not  cancel  each  other,  but  will  compound. 
And  because  the  costs  of  nuclear  disasters  are  large,  the  expected  loss 
from  such  errors  could  be  enormous. 

If  proper  calibration  Is  Important,  how  can  it  be  achieved?  One  way 
Is  to  externally  recalibrate  the  assessments  people  make.  External 
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recallbratlon  conalats  of  collecting  a set  of  assessnenta  for  items  with 
known  anstrers,  fitting  a model  to  the  data,  and  substituting.  In  future 
assessments,  the  response  predicted  from  the  model  for  the  response 
given  by  the  assessor.  The  technical  difficulties  confronting  recallbratlon 
are  substantial.  When  eliciting  the  assessments  to  be  modeled,  one  would 
have  60  be  careful  not  to  give  the  assessors  sny  more  feedback  than  they 
normally  receive,  for  fear  of  their  changing  their  calibration  as  It  Is 
being  measured.  As  Savage  (1971)  pointed  out,  ".  . . you  might  discover 
with  experience  that  your  expert  Is  optimistic  or  pessimistic  In  some  respect 
and  therefore  temper  his  Judgments.  Should  he  suspect  you  of  this,  however. 
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you  and  he  may  well  be  on  the  escalator  to  prediction  " (p.  796).  One 
would  also  have  to  be  quite  confident  that  the  real  world  matches.  In 
difficulty,  the  known  world  on  which  their  calibration  Is  measured. 

The  theoretical  objections  to  external  recallbratlon  may  be  even  more 
serious  than  the  practical  objections.  An  assessor  who  consistently 
follows  the  axioms  of  probability  theory  can  still  be  badly  calibrated. 

The  numbers  produced  by  a recallbratlon  process  on  such  an  assessor  will 
not.  In  general,  follow  those  axioms  (for  example,  the  numbers  associated 
with  mutually  exclusive  and  exhaustive  events  will  not  always  sum  to  one,  nor 
will  It  be  generally  true  that  P(A)  • P(B)  ■■  P(A,B)  for  Independent  events); 


The  most  striking  aspect  of  the  literature  reviewed  here  Is  Its 
"dust-bowl  empiricism."  Psychological  theory  Is  largely  absent,  either 
as  motivation  for  the  research  or  as  e]q>lanatlon  of  the  results.  Much 
of  the  research  seems  motivated  by  simple  questions  beginning  "vniat  would 
happen  If  we.  . . ?".  Much  of  the  Interest  In  the  research  Is  In  Its  potential 
applications.  If  people  are  going  to  have  to  assess  probabilities  In 
the  course  of  making  Important  future  decisions,  let  us  figure  out  the  best 
way  to  do  It.  We  can  not  help  feeling  that  a better  understanding  of 
the  psychological  underpinnings  of  these  findings  would  speed  the  solution 
to  these  applied  problems. 

Not  all  authors  have  avoided  theorizing.  Tversky  and  Kahneman  (1974) 
and  Slovlc  (1972)  believe  that,  as  a result  of  limited  Information-processing 
abilities,  people  adopt  simplifying  rules  or  heuristics.  Although  generally 
quite  useful,  these  heurlstlci  can  lead  to  severe  and  systematic  errors. 

For  example,  the  tendency  of  people  to  give  unduly  tight  distributions 
when  assessing  uncertain  quantities  could  reflect  the  heuristic  called 
"anchoring  and  adjustment."  When  asked  about  an  uncertain  quantity,  one 
naturally  thinks  first  of  a point  estimate,  the  most  likely  value.  This 
value  then  serves  as  an  anchor.  To  give  the  25th  or  75th  percentile,  one 
must  adjust  this  anchor  downwards  or  upwards.  But  the  anchor  has  such  a 
dominating  Influence  that  the  adjustment  Is  Insufficient;  hence  the  fractlles 
are  too  close  together,  yielding  overconfidence.  When,  however,  the 
experimenter  provides  a value,  and  the  subject  must  supply  a probability, 
the  natural  anchor  Is  the  first  probability  one  thinks  of.  If  that  first 
probability  thought  of  Is  .5  (reflecting  Initial  uncertainty  about  whether 
the  true  value  Is  above  or  below  the  value  provided) , then  Insufficient 
adjustment  from  this  natural  anchor  will  result  in  under confidence. 

Tversky  and  Kahneman  report  data  supporting  this  view.  Pltz's  (1974)  data 
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in  Figure  11,  homver,  show  overconfidence  when  a single  value  of  the 
uncertain  quantity  Is  given  to  the  subject.  If  these  subjects  were  using 
the  anchoring  and  adjustment  heuristic,  .5  was  not  their  anchor. 

Pltz  (1974),  too,  believes  that  people's  Information-processing 
capacity  and  working  memory  capacity  are  limited.  He  suggests  that  people 
set  up  complex  problems  serially,  working  through  a portion  at  a time.  To 
reduce  cognitive  strain,  people  Ignore  the  uncertainty  In  their  solutions  to 
the  early  portions  of  the  problem  In  order  to  reduce  the  complexity  of  the 
calculations  In  later  portions.  This  could  lead  to  too-tlght  distributions 
and  overconfidence.  Pltz  also  suggests  that  one  way  people  estimate  their 
own  uncertainty  Is  by  seeing  how  many  different  ways  they  can  arrive  at  an 
answer,  that  Is,  how  many  different  serial  solutions  they  can  construct.  If 
many  are  found,  people  will  recognize  their  own  uncertainty;  If  few  are  found, 
they  will  not.  The  richer  the  knowledge  base  from  which  to  build  alternative 
structures,  the  less  the  tendency  towards  overconfidence.  This  was  the 
reasoning  that  led  Pltz  to  gather  the  data  of  Figure  11,  which  support  his 
hypothesis. 

These  considerations  are  not  full-fledged  theories,  but  they  may  help 
us  to  gain  understanding  of  how  people  think  probabilistically.  Another 
notion  that  may  be  helpful  Is  coding.  How  do  we  code  in  our  minds  the 
outcomes  we  receive?  Surely  not  the  way  we  have  coded,  on  paper,  the  data 
needed  to  plot  a calibration  curve. 

A person  could  conceivably  learn  whether  his  judgments  are 
externally  calibrated  by  keeping  a tally  of  the  proportion 
of  events  that  actually  occur  among  those  to  which  he 
assigns  the  same  probability.  However,  It  Is  not  natural 
to  group  events  by  their  Judged  probability.  In  the 


absence  of  such  grouping  It  Is  impossible  for  an  individual 
to  discover,  for  example,  that  only  50  percent  of  the 
predictions  to  which  he  has  assigned  a probability  of  .9 
or  higher  actually  came  true.  (Tversky  & Kahneman,  1974,  p.  1130) 

In  addition,  as  Flschhoff  and  Beyth  (1975)  found,  even  when  subjects 
were  forced  to  assess  probabilities,  they  later  altered  their  memory  of 
these  probabilities.  Specifically,  they  remembered  assigning  higher 
probabilities  than  they  actually  had  to  events  which  later  occurred  and 
lower  probabilities  than  they  had  to  events  which  did  not  occur.  To  the 
extent  that  we  do  code  events  by  probabllltic  categories , we  bias  our 
coding  towards  overconfidence.  "The  Judge  who  is  insufficiently  aware 
of  the  surprises  the  past  held  for  him,  and  of  the  need  to  Improve  his 
performance,  seems  likely  to  continue  being  surprised  by  what  happens  in 
the  future"  (Flschhoff  & Beyth,  1975,  p.  15). 

In  conclusion,  it  seems  appropriate  to  summarize  what  we  know  about 
calibration.  We  may  characterize  our  knowledge  as  falling  into  one  of 
three  states:  understanding,  confusion.  Ignorance. 

Understanding  reigns  when  we  have  extensive  evidence  pointing  at  a 
common  conclusion  which  any  theory  must  accommodate.  Understandings  are, 
as  might  be  expected,  fairly  scarce.  One  is  that,  as  a result  of  subjects' 
failure  to  discriminate  different  levels  of  uncertainty  adequately, 
different  calibration  curves  emerge  for  tests  with  different  levels  of 
difficulty.  A second  conclusion  is  that  the  most  common  form  of  mis- 
callbratlon  is  overconfidence.  Nearly  all  the  data  about  uncertain 
quantities  point  in  this  direction,  as  do  the  discrete-proposition  data 
for  all  but  the  easiest  tasks.  If  overconfidence  is  further  evidence  of 
a general  tendency  toward  what  Dawes  (1976)  calls  "cognitive  conceit,"  it 
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is  crucial  to  understand  Its  origins,  limits  and  remedies.  A third 

and  more  optimistic  conclusion  Is  that  calibration  can  be  somewhat  Improved 

by  training. 

Confusion  reigns  when  studies  of  a given  question  point  In  contrary 
directions  or  when  we  must  put  our  faith  In  a single  study  using  but  one 
of  the  many  possible  variations  of  experimental  procedure  and  stimuli. 
Consider  for  example  the  symmetry  or  asymmetry  of  the  curves  In  different 
&ll-range  studies,  or  the  contrary  contrasts  of  the  variable-width  and 
fixed-width  methods  of  Fltz  (1974)  and  Murphy  and  Winkler  (1975),  or 
Hazard  and  Peterson's  (1973)  lonely  finding  that  odds  and  probability 
judgments  have  similar  calibration  curves. 

One  partial  solution  to  the  problem  of  divergent  findings  Is  to 
Increase  our  understanding  of  the  sampling  properties  of  calibration 
curves.  Some  conflicting  results  may  be  attributable  to  sampling 
variations.  The  second  general  solution  (aside  from  collecting  more 
data)  Is  to  Improve  our  theoretical  conceptualization  of  probability 
assessment  tasks  and  of  the  factors  which  Influence  performance.  Apparently 
divergent  findings  may  be  explained  by  previously  unnoted  differences  In 
task  characteristics  such  as  difficulty  level.  Instructions,  or  Implicit 
loss  functions. 

When  Ignorance  reigns.  It  is  the  Job  of  any  theory  to  advance 
Interesting  hypotheses  and  Identify  crucial  Issues.  Even  In  lieu  of 
developed  theories , It  Is  still  possible  to  raise  many  questions  that 
bear  answering.  What  are  the  effects  of  varying  Instructions,  e.g,, 
ardently  discouraging  the  use  of  .00  and  1.00?  Are  there  any  response 
modes  particularly  conducive  to  calibrated  judgments?  Should  one  restrict 
assessors  to  some  fixed  number  of  possible  probability  responses  (say. 
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.5,  .75,  and  .99)  which  reflects  the  number  of  meaningful  discriminations 
that  they  can  make?  What  is  the  effect  of  the  number  of  alternatives  on 
calibration?  Are  there  individual  differences  in  calibration  and,  if  so, 
what  distinguishes  well-calibrated  Judges?  Holding  task  difficulty 
constant,  neither  brains  nor  expertise  appears  to  make  much  difference. 

We  have  recently  found  that  with  a half-range,  two-alternative  task, 
heavy  reliance  on  the  responses  .50  and  1.00  (which  might  reflect  lack 
of  effort  or  perceived  inability  to  make  finer  distinctions)  is  not  a 
sign  of  inferior  calibration.  Other  than  task  difficulty,  what  does  make 
a difference?  Even  without  theoretical  advances , we  have  some  work  to  do 
before  reaching  the  bottom  of  empiricism's  dust-bowl. 
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the  long  run,  for  all  statements  assigned  a given  probability' (e.g. , the 
probability  is  .65  that  "Romania  will  maintain  i^ 'current  relation  with 
People's  China") ,’*the  proportion  that  is  true  is  equal  to  the  probability 
assigned.  For  example,  if  you  are  well  calibrated,  then  across  all  the  many 
occasions  that  you  assign  a probability  of  .8,  in  the  long  run  80X  of  them 
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should  turn  out  to  be  true.  If,  Instead,  only  70Z  are  true,  you  are  not 
well  calibrated,  you  are  over conf Ident . If  95Z  of  them  are  true,  you  are 
underconfident.  In  the  last  few  years,  there  has  developed  an  extensive 
literature  about  calibration,  reporting  both  laboratory  and  real-world 
experiments.  The  present  report  reviews  this  literature,  looking  for 
findings  that  can  be  used  to  Improve  decisions.  Among  the  major  findings 
are  the  following: 

1.  Weather  forecasters,  who  typically  have  had  several  years  of 
experience  In  assessing  probabilities,  are  well  calibrated. 

2.  Other  experiments,  using  a wide  variety  of  tasks  and  subjects, 
show  that  people  are  generally  quite  poorly  calibrated.  In  particular, 
people  act  as  though  they  can  make  much  finer  distinctions  In  their 
degree  of  uncertainty  than  Is  actually  the  case. 

3.  Overconfidence  Is  found  In  most  tasks;  that  Is,  people  tend  to 
overestimate  how  much  they  know. 

4.  Despite  the  abundant  evidence  that  untutored  assessors  are 
badly  calibrated,  there  Is  little  research  showing  how  and  how  well 
these  deficiencies  can  be  overcome  through  training. 
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